{ "cells": [ { "cell_type": "markdown", "id": "dressed-storm", "metadata": {}, "source": [ "# Handling Text Data\n", "\n", "Below are a few examples of how to play with text data. We'll walk through some exercises in class with this!" ] }, { "cell_type": "code", "execution_count": 99, "id": "prepared-roads", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "text_data = pd.read_csv(\"pa3_orig/Bills Mafia.csv\")" ] }, { "cell_type": "code", "execution_count": 100, "id": "neural-server", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "text | \n", "label | \n", "
---|---|---|---|
0 | \n", "1423946923560640514 | \n", "I haven’t seen a single story about a vaccinat... | \n", "NaN | \n", "
1 | \n", "1415395068102467588 | \n", "WHAT IS GRAPHENE OXIDE? Main Ingredient in Pfi... | \n", "NaN | \n", "
2 | \n", "1395622329376444416 | \n", "MO: Vaccine appointments available at Walgreen... | \n", "NaN | \n", "
3 | \n", "1378272239687065610 | \n", "PETITION: No to mandatory vaccination for the ... | \n", "NaN | \n", "
4 | \n", "1425352057050091521 | \n", "CDC açıkladı: Moderna ve Pfizer-BioNTech aşısı... | \n", "NaN | \n", "
\n", " | created_utc | \n", "is_crosspostable | \n", "is_self | \n", "is_video | \n", "locked | \n", "media_only | \n", "over_18 | \n", "score | \n", "subreddit_id | \n", "subreddit_name_prefixed | \n", "... | \n", "title | \n", "permalink | \n", "total_awards_received | \n", "downs | \n", "gilded | \n", "num_comments | \n", "num_crossposts | \n", "num_reports | \n", "ups | \n", "author_name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1.582163e+09 | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "146606 | \n", "t5_2qh72 | \n", "r/Jokes | \n", "... | \n", "Sad News: The founder of /r/jokes has passed away | \n", "/r/Jokes/comments/f6lii3/sad_news_the_founder_... | \n", "200 | \n", "0 | \n", "5 | \n", "1699 | \n", "9 | \n", "NaN | \n", "146606 | \n", "error521 | \n", "
1 | \n", "1.511295e+09 | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "137607 | \n", "t5_2qh72 | \n", "r/Jokes | \n", "... | \n", "Calm down about the Net Neutrality thing... | \n", "/r/Jokes/comments/7ekt23/calm_down_about_the_n... | \n", "15 | \n", "0 | \n", "2 | \n", "1614 | \n", "2 | \n", "NaN | \n", "137607 | \n", "Victorinox2 | \n", "
2 | \n", "1.499278e+09 | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "108795 | \n", "t5_2qh72 | \n", "r/Jokes | \n", "... | \n", "V | \n", "/r/Jokes/comments/6lfqep/v/ | \n", "29 | \n", "0 | \n", "7 | \n", "1360 | \n", "1 | \n", "NaN | \n", "108795 | \n", "MadGo | \n", "
3 | \n", "1.565449e+09 | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "105444 | \n", "t5_2qh72 | \n", "r/Jokes | \n", "... | \n", "If your surprised that Jeffrey Epstein commite... | \n", "/r/Jokes/comments/coj45m/if_your_surprised_tha... | \n", "48 | \n", "0 | \n", "11 | \n", "2418 | \n", "7 | \n", "NaN | \n", "105444 | \n", "williseeyoutonight | \n", "
4 | \n", "1.539007e+09 | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "100954 | \n", "t5_2qh72 | \n", "r/Jokes | \n", "... | \n", "A new Navy recruit has his first day on the su... | \n", "/r/Jokes/comments/9mf1cz/a_new_navy_recruit_ha... | \n", "25 | \n", "0 | \n", "9 | \n", "772 | \n", "6 | \n", "NaN | \n", "100954 | \n", "Ckarini | \n", "
5 rows × 21 columns
\n", "\n", " | feature | \n", "PC1 | \n", "PC2 | \n", "
---|---|---|---|
4327 | \n", "wp | \n", "0.304592 | \n", "-0.309737 | \n", "
4356 | \n", "years | \n", "0.338600 | \n", "-0.272100 | \n", "
2703 | \n", "old | \n", "0.220882 | \n", "-0.183783 | \n", "
2132 | \n", "just | \n", "0.177622 | \n", "-0.171714 | \n", "
4350 | \n", "year | \n", "0.221643 | \n", "-0.156942 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
3795 | \n", "suggests | \n", "0.065526 | \n", "0.089920 | \n", "
1537 | \n", "finds | \n", "0.068925 | \n", "0.094339 | \n", "
2628 | \n", "new study | \n", "0.129852 | \n", "0.181214 | \n", "
3765 | \n", "study | \n", "0.195533 | \n", "0.267339 | \n", "
2624 | \n", "new | \n", "0.474328 | \n", "0.650516 | \n", "
4385 rows × 3 columns
\n", "